ggplot(data = mpg). What do you see?ggplot(data = mpg)
The plot is empty because no layers have been added to ggplot().
mpg? How many columns?mpg
## # A tibble: 234 x 11
## manufac… model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.80 1999 4 auto(… f 18 29 p comp…
## 2 audi a4 1.80 1999 4 manua… f 21 29 p comp…
## 3 audi a4 2.00 2008 4 manua… f 20 31 p comp…
## 4 audi a4 2.00 2008 4 auto(… f 21 30 p comp…
## 5 audi a4 2.80 1999 6 auto(… f 16 26 p comp…
## 6 audi a4 2.80 1999 6 manua… f 18 26 p comp…
## 7 audi a4 3.10 2008 6 auto(… f 18 27 p comp…
## 8 audi a4 qua… 1.80 1999 4 manua… 4 18 26 p comp…
## 9 audi a4 qua… 1.80 1999 4 auto(… 4 16 25 p comp…
## 10 audi a4 qua… 2.00 2008 4 manua… 4 20 28 p comp…
## # ... with 224 more rows
234 rows and 11 columns in total.
drv variable describe? Read the help for ?mpg to find out.drv describes whether the car is front-wheel drive, rear wheel drive, or 4wd.
hwy vs cyl.ggplot(mpg, aes(x = hwy, y = cyl)) +
geom_point()
class vs drv? Why is the plot not useful?ggplot(mpg, aes(x = class, y = drv)) +
geom_point()
Scatterplots are suitable for displaying continuous variables (e.g., cty and hwy). class and drv are discrete variables.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
To make the points blue, colour must be set manually (i.e., it must be located outside aes():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?mpg
## # A tibble: 234 x 11
## manufac… model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.80 1999 4 auto(… f 18 29 p comp…
## 2 audi a4 1.80 1999 4 manua… f 21 29 p comp…
## 3 audi a4 2.00 2008 4 manua… f 20 31 p comp…
## 4 audi a4 2.00 2008 4 auto(… f 21 30 p comp…
## 5 audi a4 2.80 1999 6 auto(… f 16 26 p comp…
## 6 audi a4 2.80 1999 6 manua… f 18 26 p comp…
## 7 audi a4 3.10 2008 6 auto(… f 18 27 p comp…
## 8 audi a4 qua… 1.80 1999 4 manua… 4 18 26 p comp…
## 9 audi a4 qua… 1.80 1999 4 auto(… 4 16 25 p comp…
## 10 audi a4 qua… 2.00 2008 4 manua… 4 20 28 p comp…
## # ... with 224 more rows
Categorical: manufacturer, model, trans, drv, fl, class
Continuous: displ, year, cty, hwy
This information is located below the variable name in the output (e.g., <chr> indicates a character string which is categorical).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ))
It creates a gradient along whichever axis the variable is assigned to.
stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)stroke increases the border width of shapes. However, not all shapes have a border.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), stroke = 2, shape = 23)
aes(colour = displ < 5)?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
In this example, each observation with displ < 5 is grouped together (TRUE). All remaining observations form a second group (FALSE).
Each value of the continuous variable will be treated as a discrete category. This will usually result in a very large output! For example:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty, nrow = 2)
facet_grid(drv ~ cyl) mean? How do they relate to this plot?ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
It means there are no 4wd or rear wheel drive cars with 5 cylinders. Also, there are no rear wheel drive cars with 4 cylinders. No cars have 7 cylinders.
The plot above shows the same information as the plot with facet_grid(drv ~ cyl), but displayed in one plot rather than facetting a grid of multiple plots.
. do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
. is used when you do not want to facet in a row or column.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
Faceting makes it easier to see the pattern within groups. As the dataset becomes larger, it becomes more preferential to use faceting when comparing groups instead of using the colour aesthetic. However, it is more difficult to detect the overall pattern in the data when using faceting.
?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?nrow and ncol lets you specify the number of rows and columns to display when faceting. dir is an example of an option that controls the layout of panels. facet_grid() does not have nrow and ncol arguments because it facets in a combination of two variables.
facet_grid() you should usually put the variable with more unique levels in the columns. Why?Using the variable with more unique levels in the columns produces a more readable output on most computer monitors.
geom_line, geom_boxplot, geom_histogram, and geom_area.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
I did not predict three separate lines. However, the mappings are global. Therefore, these are passed to geom_point and geom_smooth.
show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class), show.legend = FALSE) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
show.legend = FALSE removes the legend from the plot for class. If you remove show.legend, the legend appears in the plot. This is because show.legend is NA by default for geom_point (see ?geom_point).
se argument to geom_smooth() do?It displays the confidence interval around smooth (this is TRUE by default).
level allows you to specify the level of the confidence interval to use (e.g., 0.95).
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
No, both will look the same because they use the same mappings. In the first example, the mappings are specified at the global level. In the second example, the mappings are specified at the local level.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 3) +
geom_smooth(aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
geom_point(size = 3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = drv), size = 3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = drv), size = 3) +
geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(x = displ, y = hwy, fill = drv)) +
geom_point(shape = 21, colour = "white", size = 3, stroke = 3)
stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?The default geom associated with stat_summary() is geom_pointrange. geom_pointrange does not automatically compute the ymin or ymax values, so these need to be specified using fun.ymin and fun.ymax from stat_summary:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)
geom_col() do? How is it different to geom_bar()?geom_col() is used when the heights of the bars represent values in the data. It uses stat_identity by default. In contrast, geom_bar() uses stat_count which bins the data prior to plotting.
Some examples:
geom_point – stat_identitygeom_line – stat_identitygeom_crossbar – stat_identitygeom_area – stat_identitygeom_boxplot – stat_boxplotgeom_violin – stat_ydensitygeom_histogram – stat_binstat_identity appears the most common. However, geoms involving one variable only or a discrete variable tend to have a different stat default (e.g., geom_histogram).
stat_smooth() compute? What parameters control its behaviour?y, ymin, ymax, and se. Basically, predicted values and the confidence interval. method determines which function to use (e.g., lm) to calculate these variables.
group = 1. Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
The problem with the above plots is that the proportions are calculated within each group, therefore showing a proportion of 1.00 for each cut. group = 1 must be set for calculating the correct proportions.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
Points may be overlapping each other (this is known as overplotting). To improve this plot, you could add some random noise to the points by using position = jitter, or, its useful shortcut, geom_jitter().
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = "jitter")
geom_jitter() control the amount of jittering?width and height.
geom_jitter() with geom_count().Here is an example of geom_count():
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_count()
geom_count() is similar to geom_point() but it maps the count to point areas. In contrast, geom_jitter() adds random noise to the each point to prevent overplotting.
geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.position = "dodge".
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot(mapping = aes(colour = drv))
coord_polar().ggplot(diamonds, aes(x = factor(1), fill = factor(cut))) +
geom_bar(width = 1) +
coord_polar(theta = "y")
As noted in the documentation for coord_polar, these plots should be used with caution because the polar coordinates have major perceptual problems.
labs() do? Read the documentation.labs() allows you to modify axis, legend, and plot labels.
coord_quickmap() and coord_map()?coord_map() projects the earth onto a flat 2D plane. It does not preserve straight lines. coord_quickmap() is a quick approximation that preserves straight lines. It requires less computation.
coord_fixed() important? What does geom_abline() do?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
There is a positive relationship between city and highway mpg.
coord_fixed() maintains the aspect ratio.
geom_abline() adds a reference line that highlights the positive relationship between these two variables. It also shows that cars get more miles per gallon on the highway than they do in the city.